---
name: literature-filtering
description: Filter literature by publication year, journal, and predefined screening rules to produce inclusion/exclusion lists; use when conducting preliminary screening or systematic review screening to narrow the literature scope.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use

- You need to quickly narrow a large bibliography by **publication year range** (e.g., 2015–2024).
- You must restrict results to a **target journal set** (e.g., a whitelist/blacklist of journals).
- You are running **preliminary screening** before full-text review and need traceable inclusion/exclusion decisions.
- You are conducting **systematic review screening** and must record consistent reasons for exclusion.
- You need standardized outputs (lists + logs) for collaboration, auditing, or downstream analysis.

## Key Features

- Rule-based filtering by **year**, **journal**, and **literature type/criteria**.
- **Journal name normalization** to match abbreviations and full names consistently.
- Structured recording of **exclusion reasons** for transparency and reproducibility.
- Support for **borderline/controversial item review** to improve consistency.
- Standardized outputs: **inclusion list**, **exclusion list**, and **screening statistics/summary**.

## Dependencies

- None (documentation-driven workflow).
- Optional template file:
  - `assets/screening_log_template.csv`

## Example Usage

The following example is a complete, runnable Python script that:
1) normalizes journal names, 2) filters by year and journal whitelist, 3) applies simple inclusion/exclusion rules, and 4) outputs inclusion/exclusion CSV files plus a screening log.

```python
#!/usr/bin/env python3
import csv
import re
from dataclasses import dataclass
from typing import Dict, List, Tuple

# ----------------------------
# Configuration (edit as needed)
# ----------------------------
YEAR_MIN = 2018
YEAR_MAX = 2024

# Journal whitelist after normalization
JOURNAL_WHITELIST = {
    "journal of finance",
    "journal of financial economics",
    "review of financial studies",
}

# Abbreviation/full-name mapping (extend as needed)
JOURNAL_ALIASES = {
    "j. finan.": "journal of finance",
    "j finan": "journal of finance",
    "jfe": "journal of financial economics",
    "rev. financ. stud.": "review of financial studies",
    "rfs": "review of financial studies",
}

# Simple keyword-based screening rules (example)
INCLUDE_KEYWORDS = {"asset pricing", "corporate finance", "risk premium"}
EXCLUDE_KEYWORDS = {"editorial", "book review", "erratum"}

# ----------------------------
# Data model
# ----------------------------
@dataclass
class Record:
    id: str
    title: str
    year: int
    journal: str
    abstract: str

# ----------------------------
# Helpers
# ----------------------------
def normalize_journal(name: str, aliases: Dict[str, str]) -> str:
    """
    Normalize journal names:
    - lowercase
    - strip punctuation
    - collapse whitespace
    - map abbreviations to canonical full names
    """
    if not name:
        return ""
    raw = name.strip().lower()
    raw = re.sub(r"[^\w\s\.]", " ", raw)  # keep dots for alias keys like "j. finan."
    raw = re.sub(r"\s+", " ", raw).strip()

    # Try alias mapping on the dot-preserved version
    if raw in aliases:
        return aliases[raw]

    # Also try a dot-stripped variant
    nodot = raw.replace(".", "")
    if nodot in aliases:
        return aliases[nodot]

    # Canonicalize by removing dots and extra spaces
    canonical = re.sub(r"[\.]", "", raw)
    canonical = re.sub(r"\s+", " ", canonical).strip()
    return canonical

def contains_any(text: str, keywords: set) -> bool:
    t = (text or "").lower()
    return any(k in t for k in keywords)

def screen_record(r: Record) -> Tuple[bool, str]:
    """
    Returns (included, reason).
    Reasons are designed to be human-auditable.
    """
    if r.year < YEAR_MIN or r.year > YEAR_MAX:
        return False, f"Excluded: year out of range ({r.year})"

    norm_journal = normalize_journal(r.journal, JOURNAL_ALIASES)
    if norm_journal not in JOURNAL_WHITELIST:
        return False, f"Excluded: journal not in whitelist ({norm_journal})"

    text = f"{r.title}\n{r.abstract}"
    if contains_any(text, EXCLUDE_KEYWORDS):
        return False, "Excluded: matches exclusion keywords"

    if not contains_any(text, INCLUDE_KEYWORDS):
        return False, "Excluded: does not match inclusion keywords"

    return True, "Included: meets all criteria"

# ----------------------------
# I/O
# ----------------------------
def read_input_csv(path: str) -> List[Record]:
    """
    Expected columns: id,title,year,journal,abstract
    """
    out = []
    with open(path, "r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            out.append(
                Record(
                    id=row.get("id", "").strip(),
                    title=row.get("title", "").strip(),
                    year=int(row.get("year", "0")),
                    journal=row.get("journal", "").strip(),
                    abstract=row.get("abstract", "").strip(),
                )
            )
    return out

def write_csv(path: str, rows: List[Dict[str, str]], fieldnames: List[str]) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)

def main():
    input_path = "input_literature.csv"
    records = read_input_csv(input_path)

    included, excluded, log = [], [], []
    for r in records:
        norm_journal = normalize_journal(r.journal, JOURNAL_ALIASES)
        ok, reason = screen_record(r)

        log.append({
            "id": r.id,
            "title": r.title,
            "year": str(r.year),
            "journal_raw": r.journal,
            "journal_normalized": norm_journal,
            "decision": "include" if ok else "exclude",
            "reason": reason,
        })

        base = {
            "id": r.id,
            "title": r.title,
            "year": str(r.year),
            "journal": norm_journal,
        }
        (included if ok else excluded).append(base)

    write_csv("included.csv", included, ["id", "title", "year", "journal"])
    write_csv("excluded.csv", excluded, ["id", "title", "year", "journal"])
    write_csv(
        "screening_log.csv",
        log,
        ["id", "title", "year", "journal_raw", "journal_normalized", "decision", "reason"],
    )

    # Simple screening statistics
    stats = {
        "total": len(records),
        "included": len(included),
        "excluded": len(excluded),
    }
    print("Screening complete:", stats)
    print("Outputs: included.csv, excluded.csv, screening_log.csv")

if __name__ == "__main__":
    main()
```

Minimal input file example (`input_literature.csv`):

```csv
id,title,year,journal,abstract
1,Asset Pricing with Risk Premiums,2020,J. Finan.,We study asset pricing and the risk premium...
2,An Editorial Note,2021,Journal of Finance,This editorial summarizes...
3,Corporate Finance Evidence,2017,JFE,Empirical corporate finance results...
```

## Implementation Details

### 1. Rule Setting

- **Year rules**: define an inclusive range `[YEAR_MIN, YEAR_MAX]`.
- **Journal rules**:
  - Use a **whitelist** (or blacklist) of canonical journal names.
  - Apply **normalization** before matching to avoid false mismatches.
- **Screening criteria**:
  - Define explicit inclusion/exclusion criteria (e.g., topic, study type, population, method).
  - Ensure each exclusion has a **single primary reason** (or a controlled multi-reason scheme).

### 2. Journal Name Normalization

Recommended normalization steps (in order):

1. Convert to lowercase.
2. Remove/standardize punctuation and collapse whitespace.
3. Apply **abbreviation/full-name mapping** (e.g., `J. Finan.` → `Journal of Finance`).
4. Output a canonical form used for matching and reporting.

Key parameters:
- `JOURNAL_ALIASES`: dictionary for abbreviation/full-name mapping.
- Normalization policy choices:
  - Case sensitivity (typically disabled by lowercasing).
  - Punctuation handling (strip most punctuation; optionally preserve dots for alias keys).
  - Whitespace collapsing.

### 3. Execution of Screening

- Apply filters in a stable order to keep decisions consistent and auditable:
  1. Year range
  2. Journal match (after normalization)
  3. Inclusion/exclusion criteria
- Record a **decision** and **reason** for every record in a screening log.

### 4. Review and Consistency

- Flag borderline items (e.g., unclear abstracts, ambiguous journal names) for manual review.
- Keep a shared, versioned rule set (year range, journal list, alias map, criteria) to ensure consistent application across reviewers.

### 5. Output Organization

Produce at minimum:
- `included.csv`: records that pass all rules.
- `excluded.csv`: records that fail at least one rule.
- `screening_log.csv`: full trace with normalized journal and exclusion reason.
- Optional: screening statistics and a reason summary (counts by reason).

Reference formats and checkpoints can be aligned with `references/guide.md` if available.